Random forestでmultilabel
RandomForestClassifier
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
make_multilabel_classification
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_multilabel_classification.html#sklearn.datasets.make_multilabel_classification
code:python
>> from sklearn.datasets import make_multilabel_classification
>> from sklearn.ensemble import RandomForestClassifier
>> from sklearn.model_selection import train_test_split
>> X, Y = make_multilabel_classification(n_samples=12, n_classes=3, random_state=0)
>> X.shape, Y.shape
((12, 20), (12, 3)) # Xは(n_samples, n_features)、Yは(n_samples, n_outputs)
>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)
>> X_train.shape, X_test.shape
((9, 20), (3, 20))
>> clf = RandomForestClassifier(max_depth=2, random_state=0)
>> clf.fit(X_train, Y_train)
RandomForestClassifier(max_depth=2, random_state=0)
>> clf.predict(X_test) # 返り値は(n_samples, n_features)
array([1, 1, 0,
1, 1, 0,
0, 1, 0])
>> Y_test
array([0, 0, 0,
0, 0, 0,
0, 1, 0])
>> clf.score(X_test, Y_test) # subset accuracy (?) 3サンプルのうち、1サンプルだけ合致ということ?
0.3333333333333333
>> # 長さ3のリスト(n_outputs=3)。classes_属性の順番
>> # 要素はarray(outputに対応)で、shapeは(3,2) = (n_samples, nega/posi) naga+posi=1
>> # しきい値0.5でpredict(スコアがnega > postなら0。逆なら1)
>> clf.predict_proba(X_test)
[array([0.386 , 0.614 , # output1 1,1,0 (predictを **縦** に見ている)
0.49766667, 0.50233333,
0.5455 , 0.4545 ]), array([0.28416667, 0.71583333, # output2 1,1,1
0.37683333, 0.62316667,
0.1515 , 0.8485 ]), array([0.538 , 0.462 , # output3 0,0,0
0.56183333, 0.43816667,
0.68916667, 0.31083333])]
>> Y_train
array([0, 1, 0,
1, 1, 1,
0, 0, 1,
0, 1, 0,
1, 0, 0,
0, 0, 0,
1, 1, 1,
0, 1, 0,
1, 1, 0])
>> clf.classes_
[array(0, 1), array(0, 1), array(0, 1)]
>> clf.n_outputs_
3
>> clf.n_features_
20
>> clf.feature_importances_
array([0.09609789, 0.04004824, 0.08110294, 0.07757968, 0.0658578 ,
0.03856713, 0.02071324, 0.0548202 , 0.04409759, 0.02824918,
0.05900911, 0.00996354, 0.02613438, 0.07684385, 0.01227006,
0.01527436, 0.09345899, 0.07768832, 0.03305281, 0.04917069])
>> len(clf.estimators_)
100
multilabelにおけるpredict_proba の扱い
動機:しきい値を0.5から変えたい
方法は下のコード参照
predictの実装 (v0.23.2)を参考にする
predictはクラス(整数)を返すので、dtypeを指定
clf.classes_[k]がlistで要素はnumpy.array
takeを使って、スコアのうち大きい方のindexを取る
例:[0.386 , 0.614 ]
大きい方のインデックスは1
clf.classes_[0]はarray([0, 1])なので、takeでスコアが大きい方のクラスが取れる
clf.classes_[0].take([1]) -> array([1])
code:python
>> proba = clf.predict_proba(X_test) # 長さがn_outputs_、要素が(n_samples, n_classes)
>> n_samples = proba0.shape0
>> positive_scores = np.empty((n_samples, clf.n_outputs_))
>> for k in range(clf.n_outputs_):
... positive_scores:, k = probak:, 1
>> positive_scores
array([0.614 , 0.71583333, 0.462 ,
0.50233333, 0.62316667, 0.43816667,
0.4545 , 0.8485 , 0.31083333])
>> (positive_scores >= 0.5).astype(int) # predict相当
array([1, 1, 0,
1, 1, 0,
0, 1, 0])
multilabelの混同行列(MCM)
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.multilabel_confusion_matrix.html#sklearn.metrics.multilabel_confusion_matrix
code: python
>> from sklearn.metrics import multilabel_confusion_matrix
>> multilabel_confusion_matrix(Y_test, clf.predict(X_test)) # class-wise
array([[1, 2, # tn, fp # 1クラス目は tn=1, fp=2
0, 0], # fn, tp
[0, 2, # 2クラス目はtp=1, fp=2
0, 1],
[3, 0, # 3クラス目はtn=3
0, 0]])
>> multilabel_confusion_matrix(Y_test, clf.predict(X_test), samplewise=True)
array([[1, 2, # 1サンプル目はtn=1, fp=2
0, 0],
[1, 2, # 2サンプル目もtn=1, fp=2
0, 0],
[2, 0, # 3サンプル目はtn=2, tp=1
0, 1]])